Search CORE

11 research outputs found

LSTM Pose Machines

Author: Lin Liang
Liu Jianbo
Luo Yue
Pan Jinshan
Pang Jiahao
Ren Jimmy
Sun Wenxiu
Wang Zhouxia
Publication venue
Publication date: 09/03/2018
Field of study

We observed that recent state-of-the-art results on single image human pose estimation were achieved by multi-stage Convolution Neural Networks (CNN). Notwithstanding the superior performance on static images, the application of these models on videos is not only computationally intensive, it also suffers from performance degeneration and flicking. Such suboptimal results are mainly attributed to the inability of imposing sequential geometric consistency, handling severe image quality degradation (e.g. motion blur and occlusion) as well as the inability of capturing the temporal correlation among video frames. In this paper, we proposed a novel recurrent network to tackle these problems. We showed that if we were to impose the weight sharing scheme to the multi-stage CNN, it could be re-written as a Recurrent Neural Network (RNN). This property decouples the relationship among multiple network stages and results in significantly faster speed in invoking the network for videos. It also enables the adoption of Long Short-Term Memory (LSTM) units between video frames. We found such memory augmented RNN is very effective in imposing geometric consistency among frames. It also well handles input quality degradation in videos while successfully stabilizes the sequential outputs. The experiments showed that our approach significantly outperformed current state-of-the-art methods on two large-scale video pose estimation benchmarks. We also explored the memory cells inside the LSTM and provided insights on why such mechanism would benefit the prediction for video-based pose estimations.Comment: Poster in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 201

arXiv.org e-Print Archive

Crossref

RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs

Author: Chen Tianshui
Luo Ping
Wang Wenping
Wang Zhouxia
Zhang Jiawei
Publication venue
Publication date: 14/08/2023
Field of study

Blind face restoration aims at recovering high-quality face images from those with unknown degradations. Current algorithms mainly introduce priors to complement high-quality details and achieve impressive progress. However, most of these algorithms ignore abundant contextual information in the face and its interplay with the priors, leading to sub-optimal performance. Moreover, they pay less attention to the gap between the synthetic and real-world scenarios, limiting the robustness and generalization to real-world applications. In this work, we propose RestoreFormer++, which on the one hand introduces fully-spatial attention mechanisms to model the contextual information and the interplay with the priors, and on the other hand, explores an extending degrading model to help generate more realistic degraded face images to alleviate the synthetic-to-real-world gap. Compared with current algorithms, RestoreFormer++ has several crucial benefits. First, instead of using a multi-head self-attention mechanism like the traditional visual transformer, we introduce multi-head cross-attention over multi-scale features to fully explore spatial interactions between corrupted information and high-quality priors. In this way, it can facilitate RestoreFormer++ to restore face images with higher realness and fidelity. Second, in contrast to the recognition-oriented dictionary, we learn a reconstruction-oriented dictionary as priors, which contains more diverse high-quality facial details and better accords with the restoration target. Third, we introduce an extending degrading model that contains more realistic degraded scenarios for training data synthesizing, and thus helps to enhance the robustness and generalization of our RestoreFormer++ model. Extensive experiments show that RestoreFormer++ outperforms state-of-the-art algorithms on both synthetic and real-world datasets.Comment: Submitted to TPAMI. An extension of RestoreForme

arXiv.org e-Print Archive

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

Author: Luo Ping
Qi Zhongang
Shan Ying
Wang Wenping
Wang Xintao
Wang Zhouxia
Xie Liangbin
Publication venue
Publication date: 04/09/2023
Field of study

This paper presents a LoRA-free method for stylized image generation that takes a text prompt and style reference images as inputs and produces an output image in a single pass. Unlike existing methods that rely on training a separate LoRA for each style, our method can adapt to various styles with a unified model. However, this poses two challenges: 1) the prompt loses controllability over the generated content, and 2) the output image inherits both the semantic and style features of the style reference image, compromising its content fidelity. To address these challenges, we introduce StyleAdapter, a model that comprises two components: a two-path cross-attention module (TPCA) and three decoupling strategies. These components enable our model to process the prompt and style reference features separately and reduce the strong coupling between the semantic and style information in the style references. StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods. Experiments have been conducted to demonstrate the superiority of our method over previous works.Comment: AIG

arXiv.org e-Print Archive

Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection

Author: Bilen Hakan
Chen Shang-Fu
Chen Tianshui
Gong Yunchao
He Kaiming
Hu Hexiang
Lawrence Zitnick C.
Li Dong
Li Qiang
Lin Tsung-Yi
Liu Feng
Romero Adriana
Simonyan Karen
Tan Mingkui
Wang Jiang
Wang Zhouxia
Xavier Glorot
Xie Pengtao
Yang Hao
Yeh Chih-Kuan
Zagoruyko Sergey
Zhang Junjie
Zhu Feng
Zhu Yi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/02/2019
Field of study

Multi-label image classification is a fundamental but challenging task towards general visual understanding. Existing methods found the region-level cues (e.g., features from RoIs) can facilitate multi-label classification. Nevertheless, such methods usually require laborious object-level annotations (i.e., object labels and bounding boxes) for effective learning of the object-level visual features. In this paper, we propose a novel and efficient deep framework to boost multi-label classification by distilling knowledge from weakly-supervised detection task without bounding box annotations. Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs. The WSD model is the teacher model and the classification model is the student model. After this cross-task knowledge distillation, the performance of the classification model is significantly improved and the efficiency is maintained since the WSD model can be safely discarded in the test phase. Extensive experiments on two large-scale datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior performances over the state-of-the-art methods on both performance and efficiency.Comment: accepted by ACM Multimedia 2018, 9 pages, 4 figures, 5 table

arXiv.org e-Print Archive

Crossref

Image Deblurring Aided by Low-Resolution Events

Author: Jiawei Zhang
Jimmy Ren
Ping Luo
Zhouxia Wang
Publication venue: MDPI AG
Publication date: 01/02/2022
Field of study

Due to the limitation of event sensors, the spatial resolution of event data is relatively low compared to the spatial resolution of the conventional frame-based camera. However, low-spatial-resolution events recorded by event cameras are rich in temporal information which is helpful for image deblurring, while intensity images captured by frame cameras are in high resolution and have potential to promote the quality of events. Considering the complementarity between events and intensity images, an alternately performed model is proposed in this paper to deblur high-resolution images with the help of low-resolution events. This model is composed of two components: a DeblurNet and an EventSRNet. It first uses the DeblurNet to attain a preliminary sharp image aided by low-resolution events. Then, it enhances the quality of events with EventSRNet by extracting the structure information in the generated sharp image. Finally, the enhanced events are sent back into DeblurNet to attain a higher quality intensity image. Extensive evaluations on the synthetic GoPro dataset and real RGB-DAVIS dataset have shown the effectiveness of the proposed method

Directory of Open Access Journals

Image Deblurring Aided by Low-Resolution Events

Author: Jiawei Zhang
Jimmy Ren
Ping Luo
Zhouxia Wang
Publication venue: 'MDPI AG'
Publication date: 18/02/2022
Field of study

Multidisciplinary Digital Publishing Institute

UNet-ESPC-Cascaded Super-Resolution Reconstruction in Spectral CT

Author: Ji Wen
Li Zhouxia
Wang Lihui
Wang Xiaoni
Yang Feng
Zhang Miaomiao
Zhu Yuemin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/12/2020
Field of study

International audienceSpectral CT based on photon counting detectors is a promising imaging modality since it provides the possibility of both obtaining CT images from multi-energy bins with a single Xray exposure and allowing low-dose imaging. However, the image quality such as spatial resolution reconstructed from multiple energy bins is degraded because of the use of narrow energy bins in spectral CT. We propose to use deep learning methods for super-resolution reconstruction of spectral CT images. To this end, we introduce an UNet-ESPC-cascaded model and perform a patch-based training to obtain the optimal parameters of the model. Experimental results on physical phantom datasets demonstrated that our deep learning based reconstruction method can reduce the F form error between the reconstructed superresolution CT image and the ground truth, by 11.6% and 5.66% with respect to respectively bilinear-interpolation-based reconstruction and iterative back projection methods. Our method achieves best results with a patch size of 20 and a stride of 15

HAL-HCL

HAL-UJM

Hal-Diderot